This practical session illustrates elements of Microbial Metagenomics data analysis, and a Microbial Biomarker development. Specifically, it includes:
It is expected that you already have some general experience with R and R markdown (including ggplot and dplyr), and some experience with machine learning in R (including caret package and random forest analysis).
The dataset is provided by the RestREco project, which is currently running in Cranfield under the lead of Prof. Jim Harris: https://restreco.com/
The RestREco project studies ecosystems after prior industrial or agricultural use. The main idea behind the project is that we will never be able to restore ecosystems to their prior state (e.g. because of the climate change, population increase or the technology developments). So, we need to develop some new criteria to manage ecosystem restoration: for instance restore to some sufficiently complex and resilient state (so called “restoration to complexity”).
Soil microbiome is just one of the many aspects considered by the project. Soil samples were collected from tens of sites in Scotland and in England, which had been previously exposed to intensive Industrial (quarrying) or Agricultural use. The dataset selected for this practical includes only woodland; additionally including a small number of ancient woodland sites. The soil microbial community was assessed using 16S-amplicon sequencing. The code provided in to you below is focusing on the effect of the Prior Land Use: Industrial vs Agricultural sites. You are expected to follow this example, but modify the code to compare the Regions: Scotland vs England.
The practical session aims to give a very broad overview of popular microbial metagenomics tools in R. So, it may be challenging to complete this practical if you are new to the microbial analysis and/or if your prior R and ML experience is limited. So don’t be discouraged if you don’t complete everything within the given time! Work with your own pace: you may keep this script and the data for a reference to use later if/when you need.
Good Luck & Happy Coding!
These packages should be installed once, hence the installation commands are commented.
It would be a good idea to install these packages in advance, so you may focus on the actual data analysis during the practical session.
# CRAN
# install.packages("vegan")
# install.packages("Polychrome")
# install.packages(dendextend)
# install.packages("ggplotify")
# install.packages("parallelMap")
# install.packages("caret")
# install.packages("randomForest")
# Bioconductor
# install.packages("BiocManager")
# BiocManager::install("phyloseq")
# BiocManager::install("ANCOMBC")
# BiocManager::install("mixOmics")
# GitHub
# install.packages("devtools")
# devtools::install_github("jbisanz/qiime2R")
date()
## [1] "Tue Mar 19 20:12:10 2024"
rm(list=ls())
graphics.off()
These are the generic packages used in may sections of the code. The remaining packages will be loaded when needed.
library(dplyr)
library(ggplot2)
The source data for this analysis was prepared in QIIME2 (see more details about QIIME2 and the data preparation in the lecture). The QIIME2 data format is called “QIIME2 artifacts”, and has the QZA file extension: for QIIME2 Zip Archive. As you may see from the file extension, it’s just a Ziped archive containing some data (such as counts matrix in BIOM format, or sequences in FASTQ format, or phylogenetic tree, or taxa names, etc) together with the technical information about how the data was generated (so called “provenance information”). You may unzip QIIME2 artifacts to see their content (optional!).
One of the most popular data formats for ecology metagenomics data in R was developed developed in Phyloseq R package (more details in the lecture). It provides a container for joint storage of feature/counts table, taxonomy, phylogeny, samples metadata, and may even keep the representative sequences for the OTUs/ASVs present in the dataset.
In this practical we will import QZA files to R using qza_to_phyloseq() function from qiime2R package. We will not use the representative sequences in our analysis, so they will not be added to the Phyloseq object in this example.
An alternative way of importing QZA to Phyloseq could be
However, the Phyloseq and QIIME2 often assume slightly different versions of BIOM specification, which may require some additional and tedious BIOM data tuning. On my experience, qiime2R package is the most robust way of importing QZA to Phyloseq (in 2024).
# Required library
library(qiime2R)
# Source files: assuming files in "data" sub-folder and
# using "\\" for Windows path separator
feature_table_qza <- "data\\feature_table.qza"
rooted_tree_qza <- "data\\rooted_tree.qza"
taxonomy_qza <- "data\\taxonomy.qza"
metadata_tsv <- "data\\samples.txt"
# Read data
data.phy <- qza_to_phyloseq(
features = feature_table_qza,
tree = rooted_tree_qza,
taxonomy = taxonomy_qza,
metadata = metadata_tsv
)
# Check result
data.phy
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 60355 taxa and 330 samples ]
## sample_data() Sample Data: [ 330 samples by 11 sample variables ]
## tax_table() Taxonomy Table: [ 60355 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 60355 tips and 60236 internal nodes ]
# Cleanup
# Here I included removal of not needed objects, packages and cleaning the RAM
# Later I will opnly remuva not needed objects.
rm(feature_table_qza, rooted_tree_qza, taxonomy_qza, metadata_tsv) # remove unnecessary objects
detach("package:qiime2R", unload=TRUE) # unload qiime2R package
gc() # free unused R memory
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 5229507 279.3 10184331 544.0 10184331 544.0
## Vcells 30170113 230.2 88785406 677.4 92120484 702.9
Phyloseq package provides many helpful functions to explore microbial data.
# Load phyloseq and check its version
library(phyloseq)
packageVersion("phyloseq")
## [1] '1.44.0'
# High-level content of the Phyloseq object
data.phy
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 60355 taxa and 330 samples ]
## sample_data() Sample Data: [ 330 samples by 11 sample variables ]
## tax_table() Taxonomy Table: [ 60355 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 60355 tips and 60236 internal nodes ]
dim(sample_data(data.phy))
## [1] 330 11
sample_variables(data.phy)
## [1] "Novogene_ID" "Site_code" "Site_Code_Plot" "Site_name" "Plot_no" "Cranfield_code" "Region" "Former_landuse" "Woodland_age" "pH" "Conductivity"
head(sample_names(data.phy))
## [1] "WP001" "WP002" "WP003" "WP004" "WP005" "WP006"
sample_data(data.phy)[1:5,1:5]
## Novogene_ID Site_code Site_Code_Plot Site_name Plot_no
## WP001 FKDN230153473-1A SW01 SW01_P1 Blackridge_Farm 1
## WP002 FKDN230153474-1A SW01 SW01_P2 Blackridge_Farm 2
## WP003 FKDN230236203-1A SW01 SW01_P3 Blackridge_Farm 3
## WP004 FKDN230153476-1A SW01 SW01_P4 Blackridge_Farm 4
## WP005 FKDN230153477-1A SW01 SW01_P5 Blackridge_Farm 5
# How many sites per Land Use and Region?
table(sample_data(data.phy)[,c("Region","Former_landuse")])
## Former_landuse
## Region Agriculture Ancient_woodland Industrial Rewilding
## England 75 10 75 10
## Scotland 75 10 75 0
dim(otu_table(data.phy))
## [1] 60355 330
otu_table(data.phy)[1:10,1:5]
## OTU Table: [10 taxa and 5 samples]
## taxa are rows
## WP001 WP002 WP003 WP004 WP005
## 90dd28446d3cd23e7707d9d5c2fb4df8 0 0 0 0 0
## 78be05943c92bdd31e9a872ee909279c 0 0 0 0 0
## d6b5940830252894a33e69c8ff9c1a23 0 0 0 0 0
## d037d048afbc9e092d04fa76c860d23d 0 0 0 0 0
## 94456191991e1095be8219ca16e0413e 0 0 0 0 0
## 741481502fb4f4eeba9211d66e4bbda0 0 0 0 0 0
## 8142f3b14a5160a88689c0568dd975dc 0 0 0 0 0
## 16da85f532caa5d7a53f71f09a518c46 0 0 0 0 0
## d7f4617ac1639732ce98499f296f9c08 0 0 0 0 0
## 2b101b9e356663a2581709b5b0981f3e 0 0 0 0 0
# Extract to matrix, and check sparsity
otu_table.mx <- as.matrix(otu_table(data.phy))
sum(otu_table.mx==0)/(nrow(otu_table.mx)*ncol(otu_table.mx)) # sparsity: 98%
## [1] 0.981901
# Clean-up
rm(otu_table.mx)
# Ranks in the dataset
rank_names(data.phy)
## [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
# Taxonomy table
dim(tax_table(data.phy))
## [1] 60355 7
tax_table(data.phy)[1:10,]
## Taxonomy Table: [10 taxa by 7 taxonomic ranks]:
## Kingdom Phylum Class Order Family Genus Species
## 90dd28446d3cd23e7707d9d5c2fb4df8 "d__Bacteria" "Patescibacteria" "Paceibacteria" "Paceibacterales" "Staskawiczbacteraceae" "GWA2-37-10" NA
## 78be05943c92bdd31e9a872ee909279c "Unassigned" NA NA NA NA NA NA
## d6b5940830252894a33e69c8ff9c1a23 "d__Bacteria" "Patescibacteria" NA NA NA NA NA
## d037d048afbc9e092d04fa76c860d23d "d__Bacteria" "Patescibacteria" NA NA NA NA NA
## 94456191991e1095be8219ca16e0413e "d__Bacteria" "Patescibacteria" NA NA NA NA NA
## 741481502fb4f4eeba9211d66e4bbda0 "d__Bacteria" "Patescibacteria" NA NA NA NA NA
## 8142f3b14a5160a88689c0568dd975dc "d__Bacteria" NA NA NA NA NA NA
## 16da85f532caa5d7a53f71f09a518c46 "d__Bacteria" "Patescibacteria" NA NA NA NA NA
## d7f4617ac1639732ce98499f296f9c08 "d__Bacteria" "Proteobacteria" "Alphaproteobacteria" "Rhizobiales_A_504705" "Xanthobacteraceae_503485" "Bradyrhizobium" NA
## 2b101b9e356663a2581709b5b0981f3e "Unassigned" NA NA NA NA NA NA
# Extract to data frame
taxa.df <- as.data.frame(tax_table(data.phy))
# What Kingdoms we have in the dataset (plain R)
table(taxa.df$Kingdom)
##
## d__Archaea d__Bacteria Unassigned
## 161 60173 21
# Number of different Phyla in the dataset (dplyr)
n_distinct(taxa.df$Phylum)
## [1] 67
# 7 most abundant Bacterial Phyla (dplyr)
taxa.df %>%
filter(Kingdom == "d__Bacteria") %>%
count(Phylum, sort=T) %>%
head(7)
## Phylum n
## 1 Proteobacteria 12460
## 2 Planctomycetota 9108
## 3 Actinobacteriota 6492
## 4 Acidobacteriota 5815
## 5 <NA> 5250
## 6 Chloroflexota 3550
## 7 Bacteroidota 3296
# Clean-up
rm(taxa.df)
# Make a small Phyloseq object (for a simple toy phylogenetic tree)
myTaxa <- taxa_names(data.phy)[200:210] # Select some taxa: you may select others if you wish :)
myTaxa.phy = prune_taxa(myTaxa, data.phy) # Make phyloseq object
myTaxa.phy # Show summary of the Phyloseq object
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 11 taxa and 330 samples ]
## sample_data() Sample Data: [ 330 samples by 11 sample variables ]
## tax_table() Taxonomy Table: [ 11 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 11 tips and 10 internal nodes ]
# Plot a phylogenetic tree
plot_tree(myTaxa.phy, label.tips = "Family",
color = "Class", size = "abundance",
plot.margin = 0.5, ladderize = TRUE,
title="Example of a phylogeny tree")
# Clean-up
rm(myTaxa, myTaxa.phy)
After exploring the data, Phyloseq also provides masny functions to manipulate the data: filter, merge, agglomerate etc (by taxa or by samples).
Note that in Phyloseq Merging/ Agglomerating is not the same as Filtering/ Sub-setting/ Pruning.
In the chunk below we will agglomerate taxa. A similar merging operation exist for samples (it will be illustrated later, when we make taxonomic barplots).
# Reminder of a summary of the source data
data.phy
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 60355 taxa and 330 samples ]
## sample_data() Sample Data: [ 330 samples by 11 sample variables ]
## tax_table() Taxonomy Table: [ 60355 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 60355 tips and 60236 internal nodes ]
# Keep only samples with Industrial or Agricultural prior use
table(sample_data(data.phy)$Former_landuse)
##
## Agriculture Ancient_woodland Industrial Rewilding
## 150 20 150 10
IndAgri.phy <- subset_samples(data.phy,
Former_landuse %in% c("Industrial","Agriculture"))
IndAgri.phy # Number of samples dropped from 330 to 300
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 60355 taxa and 300 samples ]
## sample_data() Sample Data: [ 300 samples by 11 sample variables ]
## tax_table() Taxonomy Table: [ 60355 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 60355 tips and 60236 internal nodes ]
# Reminder: what kingdoms we have in the data?
table(as.data.frame(tax_table(data.phy))$Kingdom)
##
## d__Archaea d__Bacteria Unassigned
## 161 60173 21
# Keep only Bacteria (remove Archaea and Unclassified)
BacIndAgri.phy <- subset_taxa(IndAgri.phy, Kingdom == "d__Bacteria")
BacIndAgri.phy # The OTU count dropped from 60,355 to 60,173
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 60173 taxa and 300 samples ]
## sample_data() Sample Data: [ 300 samples by 11 sample variables ]
## tax_table() Taxonomy Table: [ 60173 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 60173 tips and 60054 internal nodes ]
# Agglomerate Bacteria to Class level: to reduce the size of dataset for analysis
# (may take ~5min)
BacIndAgriClass.phy <- tax_glom(BacIndAgri.phy, taxrank="Class")
BacIndAgriClass.phy # Taxa count dropped to 156
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 156 taxa and 300 samples ]
## sample_data() Sample Data: [ 300 samples by 11 sample variables ]
## tax_table() Taxonomy Table: [ 156 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 156 tips and 155 internal nodes ]
# Note that all ranks are preserved, but the data set to NA
rank_names(BacIndAgriClass.phy)
## [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
head(tax_table(BacIndAgriClass.phy))
## Taxonomy Table: [6 taxa by 7 taxonomic ranks]:
## Kingdom Phylum Class Order Family Genus Species
## 7b973191300cc536ed0e864d139df732 "d__Bacteria" "Desulfobacterota_D" "UBA1144" NA NA NA NA
## 92e4bb6e35dc20054c1d2c8f3d3b0439 "d__Bacteria" "Desulfobacterota_I" "Desulfovibrionia" NA NA NA NA
## 2c372459c9d3749b757c83aa4ccae93c "d__Bacteria" "Desulfobacterota_I" "Syntrophobacteria" NA NA NA NA
## 93d4a957de027db558e39b26776cab7b "d__Bacteria" "Desulfobacterota_C" "Deferrisomatia" NA NA NA NA
## 20a5bdc4157cf62cae00ba7af03ef579 "d__Bacteria" "Verrucomicrobiota" "Kiritimatiellae_777934" NA NA NA NA
## 67cd63792e78e5c0624f90683f990e78 "d__Bacteria" "Verrucomicrobiota" "Verrucomicrobiae" NA NA NA NA
# Keep only taxa (classes) with counts >10 in at least 12 samples
# x in filter_taxa() stands for the rows of the feature table
# (rows = taxa, columns = samples)
BacIndAgriClassAbund.phy <- filter_taxa(BacIndAgriClass.phy,
function(x) sum(x >= 10) >= 12,
prune=TRUE)
BacIndAgriClassAbund.phy # Taxa count dropped to 76
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 76 taxa and 300 samples ]
## sample_data() Sample Data: [ 300 samples by 11 sample variables ]
## tax_table() Taxonomy Table: [ 76 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 76 tips and 75 internal nodes ]
# Explore sparsity after agglomeration
otu_table.mx <- as.matrix(otu_table(BacIndAgriClassAbund.phy)) # extract matrix
sum(otu_table.mx==0)/(nrow(otu_table.mx)*ncol(otu_table.mx)) # sparsity: 34%
## [1] 0.3372807
# Clean-up
rm(data.phy, IndAgri.phy, BacIndAgri.phy, BacIndAgriClass.phy, otu_table.mx)
Phyloseq function transform_sample_counts() allows applying a function over samples. The function takes the x argument, which refers to a column within the feature table (remember: taxa= rows, samples=columns).
transform_sample_counts() may be used for normalization, such as scaling or centering per sample.
For simplicity, in this tutorial we will normalize by total count per sample. Then we scale the fractions by the median, and round to integers because some of the downstream functions expect integer counts, rather than fractions.
An alternative commonly used way of normalizing microbial data is rarefaction: equalizing depth in samples by random removal of data from the samples with higher depth. In R, the rarefaction curves could be plottred using vegan::rarecurve(), then the rarefaction may be applied using phyloseq::rarefy_even_depth().
It’s important to remember that after combining samples or filtering taxa the normalization should be repeated.
# Counts per sample before normalization
plot(colSums(otu_table(BacIndAgriClassAbund.phy)),
ylab="Count", xlab="Samples", xaxt="n",
main="Counts per sample before normalization")
# Calculate median count per sample before normalization (33,183)
countsPerSample <- colSums(otu_table(BacIndAgriClassAbund.phy))
medianCount <- median(countsPerSample)
# Normalize by total count per sample
BacIndAgriClassAbundNorm.phy = transform_sample_counts(BacIndAgriClassAbund.phy,
function(x) x / sum(x))
# Scale and round (to have integer counts instead of fractions)
BacIndAgriClassAbundNorm.phy = transform_sample_counts(BacIndAgriClassAbundNorm.phy,
function(x) round(x*medianCount))
# Counts per sample after normalization
plot(colSums(otu_table(BacIndAgriClassAbundNorm.phy)),
ylab="Count", xlab="Samples", xaxt="n",
main="Counts per sample after normalization",
ylim=c(min(countsPerSample),max(countsPerSample)))
# Clean-up
rm(BacIndAgriClassAbund.phy, countsPerSample, medianCount)
Visualizing taxonomy is an important part of exploring microbial data.
Phyloseq provides plot_bar() function that outputs a ggplot object that can be directly piped to the geom_bar() layer.
# Select 10 first samples (for making the toy plots clearer)
MySamples <- sample_names(BacIndAgriClassAbundNorm.phy)[1:10]
MyData.phy <- prune_samples(MySamples, BacIndAgriClassAbundNorm.phy)
MyData.phy
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 76 taxa and 10 samples ]
## sample_data() Sample Data: [ 10 samples by 11 sample variables ]
## tax_table() Taxonomy Table: [ 76 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 76 tips and 75 internal nodes ]
# Make taxonomy barplot
plot_bar(MyData.phy, fill = "Phylum") +
geom_bar(aes(color=Phylum, fill=Phylum), stat="identity", position="stack") +
theme(legend.position = "bottom") +
ggtitle("Taxonomy (Phyla) per sample") +
guides(fill=guide_legend(ncol=3))
# Clean-up
rm(MySamples)
Combining low-abundant taxa may make the barplot clearer.
Unfortunately, plot_bar() does not provide such option. The code below illustrates how it can be done (modified from the Phyloseq author post https://github.com/joey711/phyloseq/issues/494 ).
It includes 3 steps:
# Convert Phyloseq object to large data frame
MeltData.df <- psmelt(MyData.phy)
MeltData.df[1:5,1:12]
## OTU Sample Abundance Novogene_ID Site_code Site_Code_Plot Site_name Plot_no Cranfield_code Region Former_landuse Woodland_age
## 733 f898c7a8c566d0beb46d5448dde2387c WP003 11758 FKDN230236203-1A SW01 SW01_P3 Blackridge_Farm 3 CU0003 Scotland Agriculture 18
## 739 f898c7a8c566d0beb46d5448dde2387c WP007 10093 FKDN230284296-1A SW02 SW02_P2 Cassafuir 2 CU0007 Scotland Agriculture 50
## 738 f898c7a8c566d0beb46d5448dde2387c WP009 9179 FKDN230153481-1A SW02 SW02_P4 Cassafuir 4 CU0009 Scotland Agriculture 50
## 732 f898c7a8c566d0beb46d5448dde2387c WP006 8565 FKDN230153478-1A SW02 SW02_P1 Cassafuir 1 CU0006 Scotland Agriculture 50
## 441 8768efd36962598377bd60d7c6d576db WP009 8492 FKDN230153481-1A SW02 SW02_P4 Cassafuir 4 CU0009 Scotland Agriculture 50
# Get counts per phylum
Phyla.df <- MeltData.df %>%
group_by(Phylum) %>%
summarise(Count=sum(Abundance))
Phyla.df
## # A tibble: 37 × 2
## Phylum Count
## <chr> <dbl>
## 1 Acidobacteriota 72806
## 2 Actinobacteriota 26413
## 3 Armatimonadota 86
## 4 Bacteroidota 9683
## 5 Bdellovibrionota_E 457
## 6 Chlamydiota 432
## 7 Chloroflexota 5695
## 8 Cyanobacteria 67
## 9 Dependentiae 134
## 10 Desulfobacterota_B 4412
## # ℹ 27 more rows
# Select the cut-off (e.g. 1% of total count)
cutoff <- 0.01 * sum(MeltData.df$Abundance)
# Select low-abundant Phyla (with total counts below the cutoff)
lowAbundant <- Phyla.df[Phyla.df$Count <= cutoff,]$Phylum
# Substitute Phylum names to "<1%" for the low-abundant phyla
MeltData.df[MeltData.df$Phylum %in% lowAbundant,]$Phylum <- '<1%'
# Make stacked barplot
ggplot(MeltData.df, aes(x=Sample, y=Abundance, fill=Phylum)) +
geom_bar(aes(), stat="identity", position="stack") +
theme(legend.position="bottom") +
ggtitle("Taxonomy (Phyla) per sample") +
guides(fill=guide_legend(nrow=3)) +
theme(axis.text.x = element_text(angle = 90)) +
labs(y= "Relative abundance", x = NULL) +
theme(legend.position = "bottom") +
guides(fill=guide_legend(ncol=3))
# Clean-up
rm(MyData.phy, MeltData.df, Phyla.df, cutoff, lowAbundant)
Phyloseq function merge_samples() allows to combine samples, so we can compare groups of samples. In this example we will combine soil samples by the Prior Land Use.
Important: after merging samples, the data should be re-normalised!
# Merge samples in groups (warnings are OK)
landUse.phy <- merge_samples(BacIndAgriClassAbundNorm.phy, "Former_landuse")
landUse.phy
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 76 taxa and 2 samples ]
## sample_data() Sample Data: [ 2 samples by 11 sample variables ]
## tax_table() Taxonomy Table: [ 76 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 76 tips and 75 internal nodes ]
# Re-Normalize after merging samples !
landUse.phy = transform_sample_counts(landUse.phy, function(x) x / sum(x))
# Plot relative taxonomic abundances
plot_bar(landUse.phy, fill = "Phylum") +
geom_bar(aes(color=Phylum, fill=Phylum), stat="identity", position="stack") +
theme(legend.position = "bottom") +
ggtitle("Taxonomy (Phyla) per land use") +
guides(fill=guide_legend(ncol=3))
# Clean-up
rm(landUse.phy)
Optional task: Make barplot combining low-abundant taxa from the landUse.phy
Combining Phyloseq and ggplot2 allow making many other taxonomy plots. Below is just a couple of examples: standard (non-stacked) barplots and heatmaps.
Non-stacked barplots
# Select a Phylum (e.g. Acidobacteriota)
acidobacteriota.phy <- subset_taxa(BacIndAgriClassAbundNorm.phy,
Phylum == "Acidobacteriota")
acidobacteriota.phy
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 8 taxa and 300 samples ]
## sample_data() Sample Data: [ 300 samples by 11 sample variables ]
## tax_table() Taxonomy Table: [ 8 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 8 tips and 7 internal nodes ]
# Plot relative abundance of classes within this Phylum
plot_bar(acidobacteriota.phy, x="Class", fill = "Class") +
geom_bar(aes(color=Class, fill=Class), stat="identity", position="stack") +
ggtitle("Classes in Acidobacteriota Phylum") +
theme(legend.position = "none")
# Clean-up
rm(acidobacteriota.phy)
Heatmaps
See ?distanceMethodList for available
distances.
See ?ordinate about available ordination
methods.
plot_heatmap(BacIndAgriClassAbundNorm.phy,
method = "PCoA", distance = "unifrac",
taxa.label = "Class", taxa.order = "Phylum",
na.value="white", low="beige", high="red",
title="Heatmap", max.label = 100)
It could be noted that R has a wide range of libraries for plotting
hetamps, e.g. see some examples here: https://www.datanovia.com/en/lessons/heatmap-in-r-static-and-interactive-visualization/
You dont have to use Phyloseq::plot_heatmap() function
to make heatmaps with microbial data. You may use
Phyloseq just to prepare and extract the feature table,
and then apply any other R heatmap package of your choice.
Alpha diversity describes the richness and evenness of microbial community within a single sample.
There are many measures of Alpha diversity. Here we plot Chao1, Shannon and Simpson indices. Use ?plot_richness to see what other measures are available in Phyloseq.
plot_richness(BacIndAgriClassAbundNorm.phy,
measures=c("Chao1", "Shannon", "Simpson"),
color="Former_landuse", x="Former_landuse",
title="Alpha diversity vs Former Land Use") +
geom_boxplot()
In the plot above you may see the trend for a higher Alpha
diversity in Post-Industrial than in Post-Agricultural soils.
Was this difference significant?
Because of the compositionality (relative nature) and
high sparsity of the standard statistical tests
assuming normality of distributions usually are not appropriate for
microbial data. Instead, it is recommended to use
non-parametric or specialized and
permutation-based tests.
In this example we only compare two groups: Agriculture vs Industrial land use. So, we just apply Wilcoxon Rank Sum test, which is appropriate for comparing two groups. If there were more than two groups to compare (e.g. if we kept the Rewilding and Ancient Woodland samples) the pairwise.wilcox.test() or pairwise Kruskall-Wallis test could be used including the appropriate adjustments for multiple testing.
Importantly: these tests are implemented outside of Phyloseq package.
# Calculate Alpha diversity indices
alpha_richness = estimate_richness(
BacIndAgriClassAbundNorm.phy, measures = c("Chao1", "Shannon", "Simpson"))
head(alpha_richness)
## Chao1 se.chao1 Shannon Simpson
## WP001 50 0 2.344960 0.8547996
## WP002 49 0 2.392025 0.8640416
## WP003 42 0 2.420025 0.8390369
## WP004 55 0 2.682299 0.8971910
## WP005 60 0 2.712531 0.8952492
## WP006 47 0 2.307757 0.8426350
# Apply Wilcoxon test
wilcox.test(alpha_richness$Shannon~
sample_data(BacIndAgriClassAbundNorm.phy)$Former_landuse)
##
## Wilcoxon rank sum test with continuity correction
##
## data: alpha_richness$Shannon by sample_data(BacIndAgriClassAbundNorm.phy)$Former_landuse
## W = 8142, p-value = 3.527e-05
## alternative hypothesis: true location shift is not equal to 0
# Clean-up
rm(alpha_richness)
Is there a significant difference in soil microbial Alpha diversity depending on the prior land use in the RestREco dataset?
Term “ordination” term in ecology analysis is used for dimentionality reduction and visualization (e.g. techniques similar to PCA). The idea is to make a 2D (or 3D) image that would reflect similarity between samples (or taxa), in such a way that similar samples (or taxa) are located close on the image; while the dissimilar samples (or taxa) are drown far apart.
Technically, an (unconstrained) ordination procedure includes two main steps:
In the example below we use Bray-Curtis distance and PCoA method for making the plot. See help to the ordinate() function (?ordinate) to find information about other available distances and methods.
In this practical we will consider the unconstrained ordination only. The constrained ordination includes an additional step that allows to add some environmental gradients (e.g. soil pH) to the plot: so we can see how the samples (or taxa) are positioned along these gradients.
# Calculating the ordination (will be used later for plotting)
ord <- ordinate(BacIndAgriClassAbundNorm.phy,
distance = "bray",
method = "PCoA")
On this plot, the taxa located close to each other are frequently found in the same samples. In my experience, plot_ordination() doesn’t handle the colours well, so in the chunk below I manage the colours manually.
# --- Make colours for Phyla --- #
# Get names of Phyla present in the dataset
phyla <- unique(data.frame(tax_table(BacIndAgriClassAbundNorm.phy))$Phylum)
# Make a colour for each phyla
library(Polychrome)
phyla_cols <- createPalette(length(phyla), c("#0000FF", "#FF0000"))
names(phyla_cols) <- phyla
#phyla_cols
#swatch(phyla_cols)
# --- Make Plot --- #
plot_ordination(BacIndAgriClassAbundNorm.phy, ord, type="taxa",
color="Phylum", label="Class") +
ggtitle("Taxa ordination: Classes",
subtitle="Coloured by Phyla, PCoA with Bray-Curtis distances")+
geom_point(size=3) +
scale_color_manual(values = phyla_cols) +
theme_bw() +
theme(legend.position = "bottom") +
guides(color=guide_legend(ncol=3))
# Clean-up
rm(phyla, phyla_cols)
Beta diversity, in effect, is an unconstrained ordination of samples. Phyloseq allows to plot it using the same plot_ordination() function that has been used above for plotting taxa. Of course, for making the ordination plot for samples you should set type=“samples”. The samples located closely on the plot have similar microbial communities, the samples located far apart have very distinct microbial communities.
plot_ordination(BacIndAgriClassAbundNorm.phy, ord, type="samples",
color="Former_landuse") + geom_point(size=3) +
stat_ellipse() +
ggtitle("Beta diversity",
subtitle="Unconstrained ordination of samples, PCoA with Bray-Curtis distances")+
theme_bw()
# Clean-up
rm(ord)
Judging by this Beta diversity plot: are Post-Industrial and Post-Agricultural microbial communities different?
Beta diversity may also be visualized using hierarchical clustering dendrograms. The code chunk below illustrates how to plot such deprograms using plain R methods (with minimal use of Phyloseq functions).
Hierarchical clustering
# Calculate distance matrix (using phyloseq)
bray_dist <- distance(BacIndAgriClassAbundNorm.phy, method="bray")
as.matrix(bray_dist)[1:5,1:5]
## WP001 WP002 WP003 WP004 WP005
## WP001 0.00000000 0.09555522 0.2874471 0.21405539 0.18252772
## WP002 0.09555522 0.00000000 0.2352294 0.15354710 0.16339692
## WP003 0.28744707 0.23522941 0.0000000 0.18898148 0.19497310
## WP004 0.21405539 0.15354710 0.1889815 0.00000000 0.07829782
## WP005 0.18252772 0.16339692 0.1949731 0.07829782 0.00000000
# Perform hierarchical clustering (outside phyloseq)
bray_clust <- hclust(bray_dist, method="average")
Plot dendrogram
The plotting uses R functions outside of Phyloseq.
# For colouring dendrograms
library(dendextend)
# Extract metadata from Phyloseq object
metadata <- data.frame(sample_data(BacIndAgriClassAbundNorm.phy))
# Make the dendrogram object
bray_dend <- as.dendrogram(bray_clust, hang=0.1)
# Get samples ordered as in the deprogram
dend_samples <- get_leaves_attr(bray_dend, "label")
# Prepare colours foe samples (by former land use)
landuse_colours <- c('green','red')[
match(metadata$Former_landuse, c('Agriculture','Industrial'))]
names(landuse_colours) <- rownames(metadata)
landuse_colours <- landuse_colours[dend_samples]
# Set colours of terminal branches (h=1)
bray_dend <- color_branches(bray_dend,
k=length(dend_samples), h=1,
landuse_colours)
# Draw plot
plot(bray_dend, main="Soil amples (coloured by former land use)",
ylab="Distances", leaflab = "none")
# Add legend
legend("topright",
legend=c('Agriculture','Industrial'),
col=c('green','red'),
bty="n",lty=1, cex=0.8)
# Clean-up
rm(bray_dend, dend_samples, bray_clust, landuse_colours)
On the plot above you may note that, overall, the samples are grouped into separate clusters, with Agriculture and Industrial samples not mixing much with each other. However this is clearly less illustrative than the samples separation shown on 3D visualization plots in the lecture.
The 3D, 2D plots, and even the dendrogram above suggested that, on average, the soil microbial communities may differ between Post-Agricultural and Post-Industrial samples. As it was noted earlier, the microbial data requires specialized tests, which are implemented outside of the Phyloseq package. The most common test used for estimating statistical significance for Beta diversity is PERMANOVA (PERmutation based Multivariate ANOVA) as implemented in adonis2() function from the vegan package.
Vegan package is as popular as Phyloseq in analysis of ecology data. However, Vegan includes more statistical tests, and puts less focus on ggplot2-compartible visualization or data manipulation (see https://vegandevs.github.io/vegan )
In the example below you may note this formula within the
vegan::adonis2() function:
~Region/Former_landuse/Site_code.
This is a way of describing multi-layer nested designs in R.
Also, you may note the strata=Region term: it does not allow permutations between regions.
More detailed discussion of restricted permutations and nested designs is beyond the scope of this introductory course (although some references are given below).
library(vegan)
adonis2(bray_dist ~ Region/Former_landuse/Site_code,
strata = metadata$Region,
data = metadata, permutations=1000)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Blocks: strata
## Permutation: free
## Number of permutations: 1000
##
## adonis2(formula = bray_dist ~ Region/Former_landuse/Site_code, data = metadata, permutations = 1000, strata = metadata$Region)
## Df SumOfSqs R2 F Pr(>F)
## Region 1 2.4290 0.21067 274.8635 0.000999 ***
## Region:Former_landuse 2 2.4437 0.21194 138.2635 0.000999 ***
## Region:Former_landuse:Site_code 56 4.5363 0.39344 9.1665 0.000999 ***
## Residual 240 2.1209 0.18395
## Total 299 11.5298 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rm(metadata, bray_dist)
This result of PERMANOVA test shows that all three studied factors have significant effect on the microbial community: the Region, the Prior Land Use, and the Sites within each Region/ Land Use group. The R2 column shows the relative contributions of each factor into the total variance of the microbial communities.
Refresher about nested designs
https://www.nature.com/articles/nmeth.3137
About R formulas for nested designs
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html
https://www.jstor.org/stable/2346786
Breifly:
If factor B is nested within factor A (e.g. we have multiple sites
within each type of the land use), then the typical ANOVA (or
regression) R formula could be written in the follwoing way:
A + B %in% A This means that we are interested in
effect of factor A abd factor B nested within A.
An equivalent way of writing this formula is
A + A:B
For R experts, it highlights that, in fact, we a looking for effects of
A and interaction of A with B.
The syntax A/B (used in our code) is a shortcut for
A + B %in% A
About vegan::adonis2() and restricted permutations
https://uw.pressbooks.pub/appliedmultivariatestatistics/chapter/permanova/
https://uw.pressbooks.pub/appliedmultivariatestatistics/chapter/restricting-permutations/
In addition to the plots illustrating Taxonomy, Phylogeny, Alpha and Beta diversity, Phyloseq may also draw the network plots. In the network plots the nodes (that could be samples or taxa) are connected by edges. The edges thickness indicates the degree of similarity.
Importantly, the location of the nodes in the network plot does not always reflect their similarity. While many algorithms attempt to place similar nodes close, the other options are also frequently used. In phyloseq::plot_net() function, the location of nodes is defined by the “laymeth” parameter, which may take the following values: “auto”, “random”, “circle”, “sphere”, “fruchterman.reingold”, “kamada.kawai”, “spring”, “reingold.tilford”, “fruchterman.reingold.grid”, “lgl”, “graphopt” and “svd”.
As usual, Phyloseq only provides the plotting functions. Further quantitative network analysis, like assessment of connectivity, centrality, etc would require other specialized tools.
The plot for the samples network below suggests that soil replicas from the same sites were similar to each other.
plot_net(BacIndAgriClassAbundNorm.phy, type = "samples",
distance = "bray", maxdist = 0.1,
color="Site_code", point_label="Site_code") +
ggtitle("Samples (colored by Sites)")
The taxa co-abundances networks should be interpreted with caution: the connections between taxa on the plot should NOT be taken as an evidence of biological or physical interaction.
# Taxa plot with default layout of nodes
plot_net(BacIndAgriClassAbundNorm.phy, type = "taxa",
distance = "(A+B-2*J)/(A+B)", maxdist = 0.1,
color="Phylum", point_label="Class") +
ggtitle("Phyla (default layout)")
# Taxa plot with circular location of nodes
plot_net(BacIndAgriClassAbundNorm.phy, type = "taxa",
distance = "(A+B-2*J)/(A+B)", maxdist = 0.1, laymeth="circle",
color="Phylum", point_label="Class") +
ggtitle("Phyla (circular layout)")
ANCOMBC package implements a method for differential abundance analysis, which takes into account the compositionality of microbial data: https://www.nature.com/articles/s41467-020-17041-7
Like Vegan and Phyloseq, ANCOMBC is a very popular package, and it can directly accept the Phyloseq objects.
# Lad package
library(ANCOMBC)
# Estimate differential abundances
# ~5min, multiple warnings are OK
ancom.output = ancombc2(data = BacIndAgriClassAbundNorm.phy, tax_level = "Class",
fix_formula = "Former_landuse",
group = "Former_landuse",
p_adj_method = "holm", alpha = 0.05, pairwise = TRUE,
prv_cut = 0.10, lib_cut = 1000,
n_cl = 4, verbose = TRUE)
# Clean-up
detach("package:ANCOMBC", unload=TRUE) # unload ANCOMBC package
The ancombc2 output contains many components that we don’t need for the downstream analysis. We only need the table with results. Moreover, within this table we will only use selected columns describing Former Landuse (ignoring the data about intercept).
The columns that we need are:
# Extract the results table
results.df <- ancom.output$res
colnames(results.df)
## [1] "taxon" "lfc_(Intercept)" "lfc_Former_landuseIndustrial" "se_(Intercept)" "se_Former_landuseIndustrial" "W_(Intercept)" "W_Former_landuseIndustrial" "p_(Intercept)" "p_Former_landuseIndustrial" "q_(Intercept)" "q_Former_landuseIndustrial" "diff_(Intercept)" "diff_Former_landuseIndustrial" "passed_ss_(Intercept)" "passed_ss_Former_landuseIndustrial"
dim(results.df) # Taxa at Class level
## [1] 76 15
# Select only columns that we need
all_results.df <- results.df %>%
select(taxon, contains("Former_landuse"))
colnames(all_results.df)
## [1] "taxon" "lfc_Former_landuseIndustrial" "se_Former_landuseIndustrial" "W_Former_landuseIndustrial" "p_Former_landuseIndustrial" "q_Former_landuseIndustrial" "diff_Former_landuseIndustrial" "passed_ss_Former_landuseIndustrial"
dim(all_results.df)
## [1] 76 8
head(all_results.df)
## taxon lfc_Former_landuseIndustrial se_Former_landuseIndustrial W_Former_landuseIndustrial p_Former_landuseIndustrial q_Former_landuseIndustrial diff_Former_landuseIndustrial passed_ss_Former_landuseIndustrial
## 1 Verrucomicrobiae -0.835731485 0.08234398 -10.14927295 5.545994e-21 4.159495e-19 TRUE TRUE
## 2 Mor1 0.221208208 0.10531749 2.10039382 3.682299e-02 1.000000e+00 FALSE TRUE
## 3 Phycisphaerae 0.055095876 0.09080649 0.60673940 5.444858e-01 1.000000e+00 FALSE TRUE
## 4 Planctomycetia -0.473222220 0.08363992 -5.65785098 3.597050e-08 2.374053e-06 TRUE TRUE
## 5 B15-G4 -0.077196631 0.09856533 -0.78320270 4.342676e-01 1.000000e+00 FALSE TRUE
## 6 UBA8108 0.003872749 0.07839781 0.04939868 9.607713e-01 1.000000e+00 FALSE FALSE
# Clean-up
rm(results.df, ancom.output)
Only the taxa that passed the additional ANCOMBC check should be considered as differentially abundant.
# Select differentially abundant taxa
dif_results.df <- all_results.df %>%
filter(diff_Former_landuseIndustrial & passed_ss_Former_landuseIndustrial) %>%
arrange(desc(lfc_Former_landuseIndustrial))
dim(dif_results.df) # 18 differentially abundant taxa
## [1] 18 8
dif_results.df
## taxon lfc_Former_landuseIndustrial se_Former_landuseIndustrial W_Former_landuseIndustrial p_Former_landuseIndustrial q_Former_landuseIndustrial diff_Former_landuseIndustrial passed_ss_Former_landuseIndustrial
## 1 Blastocatellia 0.8541556 0.11216210 7.615368 4.024883e-13 2.857667e-11 TRUE TRUE
## 2 Chloroflexia 0.6242052 0.12587898 4.958772 1.197188e-06 7.063409e-05 TRUE TRUE
## 3 UBA9160 0.5836522 0.13531146 4.313398 2.432357e-05 1.313473e-03 TRUE TRUE
## 4 Limnocylindria 0.3712548 0.08937539 4.153882 4.273923e-05 2.265179e-03 TRUE TRUE
## 5 Gammaproteobacteria 0.2580406 0.06908612 3.735057 2.248424e-04 1.169180e-02 TRUE TRUE
## 6 Desulfitobacteriia -0.3413243 0.09549417 -3.574294 4.514943e-04 2.257471e-02 TRUE TRUE
## 7 Alphaproteobacteria -0.3712055 0.07300500 -5.084658 6.520492e-07 4.042705e-05 TRUE TRUE
## 8 Cyanobacteriia -0.3921831 0.08413241 -4.661498 8.651602e-06 4.844897e-04 TRUE TRUE
## 9 Planctomycetia -0.4732222 0.08363992 -5.657851 3.597050e-08 2.374053e-06 TRUE TRUE
## 10 Nitrospiria -0.5092767 0.10193599 -4.996044 9.989573e-07 5.993744e-05 TRUE TRUE
## 11 Holophagae -0.5703991 0.09434919 -6.045618 1.263810e-08 8.467527e-07 TRUE TRUE
## 12 Polyangia_437813 -0.5730558 0.09543608 -6.004603 8.778190e-09 5.969169e-07 TRUE TRUE
## 13 Negativicutes -0.5822849 0.09555270 -6.093861 4.734435e-09 3.266760e-07 TRUE TRUE
## 14 Acidobacteriae -0.6519240 0.12569485 -5.186561 3.963850e-07 2.533303e-05 TRUE TRUE
## 15 Verrucomicrobiae -0.8357315 0.08234398 -10.149273 5.545994e-21 4.159495e-19 TRUE TRUE
## 16 Rubrobacteria -0.9065683 0.08133051 -11.146718 1.152850e-14 8.300523e-13 TRUE TRUE
## 17 Clostridia_258483 -1.1044127 0.12053246 -9.162783 8.569081e-18 6.341120e-16 TRUE TRUE
## 18 Bacilli -1.7215222 0.10558230 -16.305026 3.635168e-43 2.762728e-41 TRUE TRUE
# Add to the data frame the column with direction of change
dif_results.df = dif_results.df %>%
mutate(direct = ifelse(lfc_Former_landuseIndustrial > 0, "Positive LFC", "Negative LFC"))
# Update factors levels (may affect plot's layout)
dif_results.df$direct =
factor(dif_results.df$direct, levels = c("Positive LFC", "Negative LFC"))
dif_results.df$taxon =
factor(dif_results.df$taxon, levels = dif_results.df$taxon)
# Make plot
dif_results.df %>%
ggplot(aes(x = taxon, y = lfc_Former_landuseIndustrial, fill = direct)) +
geom_bar(stat = "identity", width = 0.7, color = "black",
position = position_dodge(width = 0.4)) +
geom_errorbar(aes(ymin = lfc_Former_landuseIndustrial - se_Former_landuseIndustrial,
ymax = lfc_Former_landuseIndustrial + se_Former_landuseIndustrial),
width = 0.2, position = position_dodge(0.05),
color = "black") +
labs(x = "Class", y = "Log fold change") +
ggtitle(label = "Differentially abundant taxa",
subtitle="Prior land use: Industrial vs Agricultural") +
scale_fill_discrete(name = NULL) +
scale_color_discrete(name = NULL) +
theme_bw() +
theme(panel.grid.minor.y = element_blank(),
axis.text.x = element_text(size = 6, angle = 60, hjust = 1))
# Clean-up
rm(all_results.df)
A number of microbial classes showed significantly different abundance between Post-Industrial and Post-Agricultural soil samples. This result confirms the previous Beta diversity findings suggesting that the prior lend use has a strong effect on soil microbial community in RestREco dataset.
Now we will build a model to detect the prior land use from the microbial community composition.
During this course you have already learned different ML algorithms that could be used for such classification task. For instance, you could use the differentially abundant taxa in the logistic regression (because we only have two classes: Agricultural and Industrial use). However, my personal opinion is that Random Forest (RF) might perform better for this dataset. In real life, you would build models using several different techniques, and compare their performance.
Because RF can detect the features that best perform for classification, we will use the entire dataset, rather then the differentially abundant taxa only.
library(ggplotify)
library(ggpubr)
library(mixOmics)
library(caret)
library(randomForest)
library(parallelMap)
counts.df <- data.frame(otu_table(BacIndAgriClassAbundNorm.phy))
samples.df <- data.frame(sample_data(BacIndAgriClassAbundNorm.phy))
taxa.df <- data.frame(tax_table(BacIndAgriClassAbundNorm.phy))
# Confirm that row names in taxa and counts are the same
all(rownames(counts.df) == rownames(taxa.df))
## [1] TRUE
# Assign new rownames to counts
rownames(counts.df) <- taxa.df$Class
# Clean-up
rm(taxa.df)
The variables (taxa) should be the columns and the samples should be the rows for the downstream analysis.
counts.df <- t(counts.df)
counts.df <- as.data.frame(counts.df)
An additional column with class should be added to counts for the downstream analysis.
# Confirm that row names in samples and (transposed) counts are the same
all(rownames(counts.df) == rownames(samples.df))
## [1] TRUE
# Add the column with class to counts data frame
data <- data.frame(class=samples.df$Former_landuse,counts.df)
# Clean-up
rm(samples.df, counts.df)
Caret package provides convenience functions to split dataset into training and validation sub-sets, while preserving the proportions of the classes.
# count number of samples in each class
table(data$class)
##
## Agriculture Industrial
## 150 150
# Make index for splitting (60% training set, 40% test set)
set.seed(42)
trainIndex <- createDataPartition(data$class, p = .6,
list = FALSE,
times = 1)
# Split
trainSet <- data[ trainIndex,]
testSet <- data[-trainIndex,]
# Check balance of classes in the test and training sets
table(trainSet$class)
##
## Agriculture Industrial
## 90 90
table(testSet$class)
##
## Agriculture Industrial
## 60 60
# Clean-up
rm(data,trainIndex)
Again, using caret convenience functions to center and scale predictors.
# Autoscaling using Caret (without the class column)
preProcValues <- caret::preProcess(trainSet[,-1], method = c("center", "scale"))
# Apply the scaling to both the training set and test set (-class)
# using the same scaling parameters (i.e. mean and sd) for both
trainTransformed <- predict(preProcValues, trainSet[,-1])
testTransformed <- predict(preProcValues, testSet[,-1])
# Return class column to the transformed data
trainTransformed <- cbind(class=trainSet$class, trainTransformed)
testTransformed <- cbind(class=testSet$class, testTransformed)
# Clean-up
rm(preProcValues, trainSet, testSet)
The code chunk below use caret interface to RF, which allows easy syntax to set training parameters. It is expected that you already practiced caret during this course before. Also, you may see references to caret and RF at the end of this practical handout.
For the sake of time, we use only a small dataset for this practical session. So the training will be quick (also, RF is a relatively fast ML method). However, in real life training of ML models may take long time. So parallelization becomes important. Here we use parallelMap R package that is well integrated with caret for training ML models.
# Set Parameter Tuning
control <- trainControl(method='repeatedcv', number=10, repeats=10, search='grid')
# Alternate Tuning Grids
grid <- expand.grid(mtry= seq(10, 200, 10))
# Start parallelization
parallelStartSocket(cpus = parallel::detectCores())
# train model (this step takes a bit of time)
# class has to be set as factor
model_rf <- train(as.factor(class)~., data=trainTransformed,
method="rf", metric = 'Accuracy',
ntrees = 800, tuneGrid = grid,
trControl = control )
# Stop parallelization
parallelStop()
# Check results
model_rf
## Random Forest
##
## 180 samples
## 76 predictor
## 2 classes: 'Agriculture', 'Industrial'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 162, 162, 162, 162, 162, 162, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 10 0.9566667 0.9133333
## 20 0.9461111 0.8922222
## 30 0.9433333 0.8866667
## 40 0.9333333 0.8666667
## 50 0.9311111 0.8622222
## 60 0.9244444 0.8488889
## 70 0.9233333 0.8466667
## 80 0.9238889 0.8477778
## 90 0.9238889 0.8477778
## 100 0.9222222 0.8444444
## 110 0.9233333 0.8466667
## 120 0.9233333 0.8466667
## 130 0.9227778 0.8455556
## 140 0.9227778 0.8455556
## 150 0.9227778 0.8455556
## 160 0.9227778 0.8455556
## 170 0.9227778 0.8455556
## 180 0.9211111 0.8422222
## 190 0.9205556 0.8411111
## 200 0.9244444 0.8488889
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 10.
# Clean-up
rm(control, grid, trainTransformed)
Now we have a forest of decision trees, that are supposed to predict Prior Land Use from the abundance of microbial classes detected in the soil samples.
How can we evaluate that these RF predictions are true?
Good that we have the test set that we initially reserved in our data: it was not used for training, and we know the true labels in this test set. So, we may use the microbial abundancies to predict the classes in the test set, and then compare the predicted classes with the true ones. Such comparison is called evaluting confusion matrix”.
# Predict classes from microbial taxes in the test set
predict.rf <- predict(model_rf, testTransformed)
# Calculate confusion matrix (note that class should be set as factor)
confusion.matrix <- confusionMatrix(predict.rf,
as.factor(testTransformed$class),
positive="Industrial")
# Show the confusion matrix
confusion.matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction Agriculture Industrial
## Agriculture 57 2
## Industrial 3 58
##
## Accuracy : 0.9583
## 95% CI : (0.9054, 0.9863)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9167
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9667
## Specificity : 0.9500
## Pos Pred Value : 0.9508
## Neg Pred Value : 0.9661
## Prevalence : 0.5000
## Detection Rate : 0.4833
## Detection Prevalence : 0.5083
## Balanced Accuracy : 0.9583
##
## 'Positive' Class : Industrial
##
# Clean-up
rm(predict.rf, testTransformed)
Has our RF model performed well?
One of the advantages of the RF method is that it allows to estimate how different predictors contributed into the final classification. Thus, each Random Tree in the Forest puts all predictors in the order of their importance within this tree. Then, the average rank of a predictor over all the Trees gives its overall rank for the entire Forest.
The code chunk below illustrates how this information can be extracted from the RF model that we have built.
# Get taxa importance from the model
varImp.res <- varImp(model_rf)
varImp.df <- varImp.res$importance
# Look at the top informative taxa
varImp.df %>%
arrange(desc(Overall)) %>%
head(10)
## Overall
## Bacilli 100.000000
## Gammaproteobacteria 52.067749
## Verrucomicrobiae 34.110343
## Polyangia_463783 27.421908
## Limnocylindria 22.528468
## Clostridia_258483 20.283050
## Ktedonobacteria 17.939605
## Blastocatellia 17.136488
## Actinomycetia 14.267658
## Acidimicrobiia_401430 9.974571
# Order the taxa by importance (for plotting)
varImp.df <- cbind(taxa=rownames(varImp.df), varImp.df)
varImp.df <- varImp.df[ order(varImp.df$Overall), ]
# Make the plot
ggbarplot(varImp.df, "taxa", "Overall",
orientation = "horiz",fill = "#619CFF") +
theme_bw()
# Clean-up
rm(varImp.res)
How many of the top 10 taxa most informative for RF model are differentially expressed by ANCOMBC2?
You have completed this practical session!
QIIME2 and BIOM:
https://qiime2.org
https://biom-format.org
Phyloseq:
https://joey711.github.io/phyloseq
https://www.yanh.org/2021/01/01/microbiome-r/
https://micca.readthedocs.io/en/latest/phyloseq.html
https://vaulot.github.io/tutorials/Phyloseq_tutorial.html
Vegan and PERMANOVA:
https://vegandevs.github.io/vegan/
https://uw.pressbooks.pub/appliedmultivariatestatistics/chapter/permanova/
ANCOMBC2:
https://www.nature.com/articles/s41467-020-17041-7
https://bioconductor.org/packages/release/bioc/vignettes/ANCOMBC/inst/doc/ANCOMBC2.html
Caret and Random Forest:
https://topepo.github.io/caret/
https://topepo.github.io/caret/train-models-by-tag.html#random-forest
ls()
## [1] "BacIndAgriClassAbundNorm.phy" "confusion.matrix" "dif_results.df" "model_rf" "varImp.df"
sessionInfo()
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 LC_CTYPE=English_United Kingdom.utf8 LC_MONETARY=English_United Kingdom.utf8 LC_NUMERIC=C LC_TIME=English_United Kingdom.utf8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] parallelMap_1.5.1 randomForest_4.7-1.1 caret_6.0-94 mixOmics_6.24.0 MASS_7.3-60 ggpubr_0.6.0 ggplotify_0.1.2 vegan_2.6-4 lattice_0.21-8 permute_0.9-7 dendextend_1.17.1 Polychrome_1.5.1 phyloseq_1.44.0 ggplot2_3.5.0 dplyr_1.1.2
##
## loaded via a namespace (and not attached):
## [1] fs_1.6.3 matrixStats_1.2.0 bitops_1.0-7 DirichletMultinomial_1.42.0 lubridate_1.9.3 RColorBrewer_1.1-3 httr_1.4.7 doParallel_1.0.17 numDeriv_2016.8-1.1 tools_4.3.1 doRNG_1.8.6 backports_1.4.1 utf8_1.2.3 R6_2.5.1 DT_0.32 lazyeval_0.2.2 mgcv_1.9-0 rhdf5filters_1.12.1 withr_3.0.0 gridExtra_2.3 cli_3.6.1 Biobase_2.60.0 sandwich_3.1-0 labeling_0.4.3 sass_0.4.8 mvtnorm_1.2-4 proxy_0.4-27 yulab.utils_0.1.4 foreign_0.8-84 scater_1.28.0 parallelly_1.37.1 decontam_1.20.0
## [33] readxl_1.4.3 rstudioapi_0.15.0 RSQLite_2.3.5 generics_0.1.3 gridGraphics_0.5-1 gtools_3.9.5 car_3.1-2 Matrix_1.6-0 biomformat_1.28.0 ggbeeswarm_0.7.2 fansi_1.0.4 DescTools_0.99.54 S4Vectors_0.38.1 DECIPHER_2.28.0 abind_1.4-5 lifecycle_1.0.4 scatterplot3d_0.3-44 multcomp_1.4-25 yaml_2.3.7 carData_3.0-5 SummarizedExperiment_1.30.2 recipes_1.0.10 rhdf5_2.44.0 grid_4.3.1 blob_1.2.4 crayon_1.5.2 beachmat_2.16.0 pillar_1.9.0 knitr_1.45 GenomicRanges_1.52.1 boot_1.3-28.1 gld_2.6.6
## [65] corpcor_1.6.10 future.apply_1.11.1 codetools_0.2-19 glue_1.6.2 data.table_1.14.8 MultiAssayExperiment_1.26.0 vctrs_0.6.3 png_0.1-8 treeio_1.24.3 Rdpack_2.6 cellranger_1.1.0 gtable_0.3.4 cachem_1.0.8 gower_1.0.1 xfun_0.39 prodlim_2023.08.28 rbibutils_2.2.16 S4Arrays_1.0.6 survival_3.5-5 timeDate_4032.109 SingleCellExperiment_1.22.0 iterators_1.0.14 hardhat_1.3.1 lava_1.7.3 gmp_0.7-4 TH.data_1.1-2 ipred_0.9-14 nlme_3.1-162 bit64_4.0.5 GenomeInfoDb_1.36.4 bslib_0.6.1 irlba_2.3.5.1
## [97] vipor_0.4.7 rpart_4.1.19 colorspace_2.1-0 BiocGenerics_0.46.0 DBI_1.2.2 Hmisc_5.1-1 nnet_7.3-19 ade4_1.7-22 NADA_1.6-1.1 Exact_3.2 tidyselect_1.2.0 bit_4.0.5 compiler_4.3.1 htmlTable_2.4.2 BiocNeighbors_1.18.0 expm_0.999-9 DelayedArray_0.26.7 checkmate_2.3.1 scales_1.3.0 stringr_1.5.1 digest_0.6.33 minqa_1.2.6 rmarkdown_2.25 XVector_0.40.0 htmltools_0.5.7 pkgconfig_2.0.3 base64enc_0.1-3 lme4_1.1-35.1 sparseMatrixStats_1.12.2 MatrixGenerics_1.12.3 highr_0.10 fastmap_1.1.1
## [129] rlang_1.1.1 htmlwidgets_1.6.4 BBmisc_1.13 zCompositions_1.5.0-1 DelayedMatrixStats_1.22.6 farver_2.1.1 jquerylib_0.1.4 zoo_1.8-12 jsonlite_1.8.7 energy_1.7-11 BiocParallel_1.34.2 ModelMetrics_1.2.2.2 BiocSingular_1.16.0 RCurl_1.98-1.12 magrittr_2.0.3 Formula_1.2-5 scuttle_1.10.3 GenomeInfoDbData_1.2.10 Rhdf5lib_1.22.0 munsell_0.5.0 Rcpp_1.0.11 ape_5.7-1 viridis_0.6.5 CVXR_1.0-12 pROC_1.18.5 stringi_1.7.12 rootSolve_1.8.2.4 zlibbioc_1.46.0 plyr_1.8.8 listenv_0.9.1 parallel_4.3.1 ggrepel_0.9.5
## [161] lmom_3.0 Biostrings_2.68.1 splines_4.3.1 multtest_2.56.0 igraph_1.5.0 ggsignif_0.6.4 rngtools_1.5.2 reshape2_1.4.4 stats4_4.3.1 ScaledMatrix_1.8.1 evaluate_0.23 nloptr_2.0.3 foreach_1.5.2 tidyr_1.3.0 purrr_1.0.1 future_1.33.1 rsvd_1.0.5 broom_1.0.5 Rmpfr_0.9-5 RSpectra_0.16-1 e1071_1.7-14 tidytree_0.4.6 rstatix_0.7.2 viridisLite_0.4.2 class_7.3-22 gsl_2.1-8 rARPACK_0.11-0 truncnorm_1.0-9 tibble_3.2.1 lmerTest_3.1-3 ellipse_0.5.0 memoise_2.0.1
## [193] beeswarm_0.4.0 IRanges_2.34.1 cluster_2.1.4 TreeSummarizedExperiment_2.8.0 timechange_0.3.0 globals_0.16.2 mia_1.8.0
date()
## [1] "Tue Mar 19 20:19:27 2024"